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Abstract 

Because of their multimodality, mixture posterior distributions are difficult to sample 
with standard Markov chain Monte Carlo (MCMC) methods. We propose a strategy to 
enhance the sampling of MCMC in this context, using a biasing procedure which origi- 
nates from computational Statistical Physics. The principle is first to choose a "reaction 
coordinate" , that is, a "direction" in which the target distribution is multimodal. In 
a second step, the marginal log-density of the reaction coordinate with respect to the 
posterior distribution is estimated; minus this quantity is called "free energy" in the com- 
putational Statistical Physics literature. To this end, we use adaptive biasing Markov 
chain algorithms which adapt their targeted invariant distribution on the fly, in order to 
overcome sampling barriers along the chosen reaction coordinate. Finally, we perform an 
importance sampling step in order to remove the bias and recover the true posterior. The 
efficiency factor of the importance sampling step can easily be estimated a priori once 
the bias is known, and appears to be rather large for the test cases we considered. 

A crucial point is the choice of the reaction coordinate. One standard choice (used for 
example in the classical Wang-Landau algorithm) is minus the log-posterior density. We 
discuss other choices. We show in particular that the hyper-parameter that determines 
the order of magnitude of the variance of each component is both a convenient and an 
efficient reaction coordinate. 

We also show how to adapt the method to compute the evidence (marginal likelihood) 
of a mixture model. We illustrate our approach by analyzing two real data sets. 

Keywords: Adaptive Biasing Force; Adaptive Biasing Potential; Adaptive Markov chain 
Monte Carlo; Importance sampling; Mixture models. 



1 Introduction 



Mixture modeling is presuma bly the most popu l ar an d the most flexible way to m odel 



heterogeneous data; see e .g. Titterington et al. ( 19861 ). McLachlan and Peel ( 2000l ) and 



Friihwirth-Schnatter ( 20061 ) for an overview of applications of mixture models. Due to the 
emergence of Markov chain Monte Carlo (MCMC), intere st in the Bayesian analysis of such 
models has sharply increased in recent years, starting with lDiebolt and Robertl (119941) , How- 



ever, MCMC analysis of mixtures poses several problems (|Celeux et al.l . [2000J; iJasra et al. 
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20051 ). In this paper, we focus on the difficulties arising from the multimodality of the poste- 
rior distribution. 

For the sake of exposition, we concentrate our discussion on univariate Gaussian mixtures, 
but we explain in the conclusion (Section [6]) how our ideas may be extended to other mixture 
models. The data vector y = (j/i, . . . ,y n ) contains independent and identically distributed 
(i.i.d.) real random variables with density 



A 



p{y\Q) = Y[p(yi\Q), p{yi\Q) = Yl qk ^ x k x )' 



(i) 



i=l 



fc=l 



where the vector 6 contains all the unknown parameters, i.e. the mixture weights q\, ...,qx-i, 
the means /ii, . . . , \lk, the precisions Ai, . . . , Xk, and possibly hyper-parameters, and <£>(•; fi, A" 1 ) 
denotes the Gaussian density with mean \i and variance A -1 . 

This model is not identifiable, because both the likelihood and the posterior density are 
invariant by permutation of components, provided the prior is symmetric. This is the root 
of the aforementioned problems. By construction, any local mode of the posterior density 
admits K\ — l symmetric replicates, while a typical MCMC sampler recovers only one of these 
K\ copies. A possible reme dy is to introduce steps that permute randomly the components 

However, mix ture posteri o r dist ributions are often "genuinely 

(2005 ): Th e number of sets of K\ 



(jFruhwirth-Schnatterl . 12001 
multimodal" , following the terminology of iJasra et al 



equivalent modes is often larger than one; see also lMarin and Robertl (|2007l . Chap. 6) for an 
example of a multimodal posterior distribution generated by an identifiable mixture model, 



and Section 



Celeux et al 



2.21 of this p aper for an e xamp le based on real data. Therefore, and following 



(|200d ) and Ijasra et al.1 ([20051 ) . we consider that a minimum requirement of 
convergence for a MCMC sampler is to visit all possible labelling of the parameter, without 

resorting to random permutations. i 

' 2Ql3), 



Inspired by techniques used in molecular dynamics (jLelievre et al 



our aim is to 

develop samplers that meet this requirement, using an importance sampling strategy where 
the importance sampling function is the marginal distribution of a well chosen variable. The 
principle of the method is (i) to choose a "reaction coordinate" , namely a low-dimensional 
function of the parameters 9\ (ii) to compute the marginal log-density of this reaction coordi- 
nate (minus this log-density is called the "free energy" in the molecular dynamics literature); 
and (iii) to use the free energy to build a bias for the target of the MCMC sampler, in order to 
move more freely between the different modal regions of the initial target distribution. More 
precisely, the biased log-density is obtained by adding the free energy to the target log-density; 
this changes the marginal distribution of the reaction coordinate into a uniform distribution 
(within chosen bounds). At the final stage of the algorithm, the bias can be removed by 
performing a simple importance sampling step from the biased posterior distribution to the 
original unbiased posterior distribution. Expectations with respect to the posterior distribu- 
tion may be computed from the weighted sample. We also derive a method for computing 
the evidence (marginal likelihood). 

If the reaction coordinate is well chosen, it is likely that a MCMC sampler targeted at 
the biased posteri o r cony erges to equilibrium much faster than a standard MCMC sampler, 
see Lelievre et al. ( 20081 ). Two questions are then in order: How to choose the reaction 
coordinate? And how to compute the free energy? 

Concerning the choice of the reaction coordinate, the application of free energy based 
methods to mixture models is not straightforward. In many cases, samplers targeting a 
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mixture posterior distribution are metastable: This term means that the trajectory generated 
by the algorithm remains stuck in the vicinity of some local attraction point for very long 
times, before moving to a different region of the accessible space where it again remains 
stuck. We show that the degree of metastability of a sampler targeting a mixture posterior 
distribution is strongly determined by certain hyper-parameters, typically those that calibrate 
in the prior the spread of each Gaussian component. We show that such hyper-parameters 
are good reaction coordinates, in the sense that (i) it is possible to efficiently compute the 
associated free energy (by adaptive algorithms, see below), (ii) the corresponding free energy 
biased dynamics explores quickly the (biased) posterior distribution, and (hi) the points 
sampled from the biased distribution have non-negligible importance weights with respect to 
the original target posterior distribution. Other reaction coordinates are discussed, such as 
the posterior log-density, which is also a good reaction coordinate (with the problem however 
that an appropriate range of variation for this reaction coordinate, which is needed for the 
method, is difficult to determine a priori). T his reaction coordinate is the standar d cho ice for 
the W ang-Landau algorithm, see for instance Atchade and Liu ( 20ld ). Liang ( 20051 ) and Liang 
(l2010h 'or related works in Statistics. 



To compute the free energy , we resort to adaptive biasing algor i thms (jDarve and Pohorilld . 
200ll : Irlenin and Chipotl . 120041 : iMarsili et all . I2OO6I : iLehevre et all 120071 ) . In contrast with the 
adaptive MCMC algorithm s usually considered in the statistical literature (see the review of 
Andrieu and Thorns! ( 20081 ) and references therein), these adaptive algorithms modify sequen- 
tially the targeted invariant distribution of the Markov chain, instead of the parameters of 
the Markov kernel. Such al gorithms were initial ly designed to compute the free energy in 
molecular dynamics; see also Lelievre et al. ( 2O10l ) for a recent review of alternative standard 
techniques in molecular dynamics for computing the free energy, such as e.g. thermodynamic 
integration. 

It is of course possible to combine our approach with other s trategies aim ed at overcoming 



multim odality, such as tempering methods; see e.g. Ilbal (120011 ): iNeall (120011 ) and lCeleux et al 
(|2000h . 

The paper is organized as follows. Section [2] presents the univariate Gaussian mixture 
model, and the difficulties encountered with classical MCMC algorithms. Section [3] describes 
the method we propose, which is based on free energy biased dynamics. Section[4]explains how 
to apply this method to Bayesian inference on Gaussian mixture models. Section [5] illustrates 
our approach with two real data-sets. Section [6] discusses further research directions, in 
particular how our approach may be extended to other Bayesian models. 



2 Scientific context 



2.1 Gaussian mixture posterior distribution 

As explained in the introduction, we focus on the univariate Gaussian mix ture model (pQ), asso- 
ciated with the following prior, taken from iRichardson and Greenl (119971 ). which is symmetric 
with respect to the components k = 1, . . . , K: 

P 



N(m, k _1 ), 
Gamma(a, /3), 
Gamma(g, h), 
Dirichlet^(l, . . 



!)• 
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In our examples, we take m = M, k = A/R 2 , a = 2, g = 0.2, h = lO Oq / aR 2 , where R a nd M 
are respectively the range and the mean of the observed data, as in Ijasra et al.1 (l2005h . The 
posterior density reads 



p(o\y) = T=-p(e) P (y\o) 

Zr 

K K/2ghpKc+g-l 



(2) 



K 



a-l 



K 



K 



Z K T(a) K T(g)(27r\ 



n + K 



I A fc exp \ - - £> fe ~ Mf - p [h + ^ 



\k=i 



k=l 



n 

i=i 



^gfeA^expj-y^ -^ fc ) 2 | 



In this expression, 8 is the vector 

8 = (q 1 ,...,q K _ 1 , f , 1 ,..., f i K ,\ 1 ,...,\ K ,f3) en = S K xR K x (R + ) K+ \ (3) 

the set 

Sk = I (91, • • • , G J2 * ^ X } 

is the probability simplex, and 

(4) 



^ = / P(0)p(j/|0) ^ 
Jn 

is the normalizing constant (namely the marginal likelihood in y), which depends on K and y. 
The sampling of the posterior distribution with density (|2|) is the focus of the paper. 

2.2 Metastability in Statistical Physics 

In this section, we draw a parallel between the problem of sampling ([2]) and the problem 
of sampling Boltzmann- Gibbs probabi lity measures that arise in computational Statistical 
Physics; see for instance (Balian, 20071 ). We take this opportunity to introduce some of the 
concepts and terms of computational Statistical Physics that are relevant in our context. 

In computational Statistical Physics, one is often interested in computing average thermo- 
dynamic properties of the system under consideration. This requires sampling a Boltzmann- 
Gibbs (probability) measure 



7r(0) = iexp{-F(0)}, Z 



exp {-!/(#)} dB, 



(5) 



where 9 G C ffi d , and V is the potential of the system, assumed to be such that Z < oo. 
From now on, the term potential refers to the logarithm of a given (possibly unnormalized) 
probability density: e.g. V(9) = — log{p(9)p(y\9)} for the posterior density (J2j) . 

Probability densities such as the posterior density ([2]) for mixture models, or the Boltzmann 
Gibbs density ([5]) for systems in Statistical Physics, are ofte n multimodal. Standard sampling 
strat egies, for instance the Hastings-Metropolis algorithm ( Metropolis et al. . 19531 ; Hastings . 

generate samples which typically remain stuck in a small region around a local maxi- 
mum (also called local mode) of the sampled distribution (or, equivalently, a local minimum 
of the potential V). The sequences of samples generated by these algorithms are said to be 
metastable. 



4 



Figure Q] illustrates this phenomenon, in the Bayesian mixture context, for the poste- 
rior ((2|), with K = 3 components, and for two datasets (Fishery data, see Section [5TTT and 
Hidalgo stamps data, see Section HT2"]) . The posterior density admits at least K\ = 6 equiv- 
alent modes, but very few mode switchings (if any) are observed in the MCMC trajectories. 
A simple Gaussian random walk Hastings- Metropolis is used, but we obtained the same type 
of plots (not shown) with Gibbs sampling. 
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Figure 1: Left (resp. right) hand side corresponds to Fishery (resp. Hidalgo stamps) dataset. 
Top row: histograms of the data and an estimated 3-component Gaussian mixture probability 
density function (solid line). The dashed line corresponds to a local mode of the posterior den- 
sity. Bottom row: random walk Hastings-Metropolis trajectories of (^1,(12,^3) as a function 
of the number of iterations. See Section [5] for more details on the data and the sampler. 

The top right plot of Figure [1] also represents a Gaussian mixture probability density (in 
dashed line) which corresponds to a negligible (in terms of posterior probability) local mode 
of the posterior density for the Hidalgo dataset. In numerical tests not reported here, when 
this local mode is used as a starting point, the Hastings- Metropolis sampler used needs about 
T = 10 5 iterations to leave the attraction of this meaningless mode. It is easy to make this 
problem worse by slightly changing the data. For instance, T is increased to 10 7 by adding 2 to 
the three largest y^s (while this local mode remained of very small posterior probability). This 
local mode illustrates the typical "genuine multimodality" of mixture posteriors mentioned 
in the introduction - multimodality which cannot be cured by label permutation. 
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3 Free-energy biased sampling 



Consider a generic probability density ir(9), with 9 £ Q C M. d as denned in ([5]). The principle 
of free energy biased sampling (described more precisely in Section 13. 1 j) is to change the 
original density tt to the biased density: 

ir A (9) <x7r(6<)exp {(,4 o £)(£)}, 

where £ is some real-valued function 

C • ^ ^ [^min? ^max] 

where [z mm , z max ] C 1 is a bounded interval, (A o = A(^(8)), and the so-called free 

energy A (see definition ([6]) below) is such that the distribution of £(#) when 6 is distributed 
according to tta is uniform over the interval [z m \ n , z max ] . By sampling iTA(9)d8 rather than 
7r(#) d9, the aim is to remove the metastability in the direction of £. Averages with respect 
to the original distribution of interest 7r((9) d9 are finally obtained by standard importance 
sampling (see Section [373]) . We assume first that £(6>) takes values in a bounded interval 
[z m i n , z max ], (think of £(#) = q\ and [z m \ n , z max ] = [0, 1] for the mixture posterior distribution 
case), and postpone the discussion of how to treat reaction coordinates with values in an 
unbounded domain to Section 13.41 

An important part of the algorithm is to compute (an approximation of) the free energy A. 
There are many ways to this end. We describe a class of methods which are very efficient in 
the field of computational Statistical Physics (see Section 13. 2p and which, to our knowledge, 
have not been used so far in Statistics. 

3.1 Principle of the method 

Consider the conditional probability measures associated with £: 

exp{-V(6)}5 m _ s (dB) 



ir^de | £(0) 



exp{-V(60} 8 m - z (d&) 



where 5^_ z {d9) is a measure with support 

Y.{z) = \eeSl I £(#) = z}, 
defined by the formula: for all smooth test functions <p and ip, 



Jq 



HmMO) d0 = / 4>(z) / y{9) 5 m _ z {d0) dz. 

Q J JT,{z) 

The main assumption underlying free-energy biased methods is that the function £ is chosen 
so that the sampling of ir^(d9 \ £(6>) = z) is significantly "easier" than the sampling of tt(0) d9, 
at least for some values of z. In other words, ir-{d8 \ £(#) = z) should be much less multimodal 
than ir(9) d9, at least for some values of z, see the discussion in Section [4.21 below. 

To give a concrete example, consider the choice £(#) = q\. In this case, 7r^(d9 | = z) 
is the conditional posterior distribution of all variables except qi, conditionally on q± = z. 
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The function £ is called a reaction coordinate in the physics literature, because of its phys- 
ical interpretation: this function £ parameterizes the progress of some chemical event at a 
coarser scale (chemical reaction or change of conformation for example). Given the trajectory 
{9t}t>o of a Markov chain, £(#t) is typically a metastable trajectory, and varies on timescales 
much larger than the typical microscopic fluctuations of the system around its metastable 
configurations. Of course, in our Bayesian mixture context, this physical interpretation is 
not very useful. For the moment, we stick with the more generic (and informal) understand- 
ing mentioned at the beginning of this paragraph, i.e. the sampling of ir^(d9 | £(#) = z) 
should be "easier" than the sampling of ir(9) d6. We defer the important discussion on how 
to interpret and choose this "reaction c oordi n ate" in our specific context to Se ction 14.21 We 
also refer the readers to Lelievre et al. ( 20081 ): Lelievre and Minoukadeh ( 2011 ) for a precise 
quantification of this concept in a functional analysis framework. Finally, although we con- 
sider only one-dimensional reaction coordinates in this paper, we me ntion that extensions 
to reaction coordinates w ith values in W 71 with m > 2 are possible ( Lelievre et al. . 2O10l : 



Chipot and Lelievre . 2010l ). Some alg orithms allowing to sw itch between different reaction 
coordinates have also been developed (jPiana and Laid . 120071 ) . 

The free energy A(z) is defined as 



exp{-A(z)} 



E(*) 



ex V {-V(9)}5 m _ z (de), 



(6) 



see for instance Section 5.6 in iBalianl (120071 ). In other words, the free energy is minus the 
marginal log-density of the reaction coordinate. As above, let us again consider the simple 
example when the reaction coordinate is £(#) = q±. Then, the free-energy is simply, up to an 
additive constant, equal to 



A(qi 



where 




l(qi)xR K x(R+) K+1 



exp {—V(9)} dqi ■ ■ ■ dqx-i dpL\ . . . dfijc d\\ . . . d\x d/3 



K-l 



S K -i(qi) = I (q 2 ,...,q K -l) € (M+) X " 2 , £ ft = 1 



i=2 



In words, the free energy is in this case the opposite of the log-marginal density of q\. 
The free energy can be used to bias the target density it as follows: 



tta(9) 



1 

~Z~A 



eX p{_y(0) + (Ao£)(0)} 



We refer to densities of this form as free energy-biased densities. The essential property 
of tta(9) is that, by construction, the corresponding marginal distribution of £ is uniform 
on the interval [z m i n , z m ax]- A sampler targeting iTA(9)d9 is thus much less likely to be 
metastable (namely to get stuck around a local minimum of the density) than a sampler 
targeting -n{9)d9, because (i) the former sampler should move freely along the direction £(#) 
defined by the reaction coordinate, since the marginal distribution of £(#) is uniform, and 
(ii) we have assumed that the reaction coordinate £(#) is such the conditional probability 
distributions -K A {d9 \ £(9) = z) = ir^(d9 \ £(0) = z) are easy to sample, at least for some 
values of z (namely that they do not have very separated modes). 
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Therefore, free energy-based methods aim at sampling tta(0), in order to move freely 
across the sampling space. Then, tt is eventually recovered through an importance sampling 
step, from tt a to tt: for any test function (p, 



E 



We refer to Section 13.31 for further precisions. Note that for ([7]) to hold, A only needs to be 
defined up to an additive constant. 



/ ip(6)exp{-Ao£(8)} ir A (e)d9 E 71 ^ (tp exp {—A o £} i 

/ v (6)*(0)dB = &- r = \ f. (7) 

Jn / exp {-A o £(9)} ir A (6) dO ^ A ( exp {-A o 1 

Jn 



3.2 Computing the free energy by adaptive methods 



In most cases, the free energy A defined in ([6]) does not admit a closed-form expression, 
and must be estimated. There are nowadays many techniques to this end, with various 
degrees of efficiencies and conceptual complexities. We present in this section some powerful 
algorithms, namely adaptive biasing methods, which are not so well known in the statistical 
literature. Of course, a ny other standard m ethod such as thermodynamic integration could 
be used (see the book ( Lelievre et al. . 20 id ) for a precise presentation of standard methods 
fo r free energy computation s in the framework of computational Statistical Physics, as well 
Gelman and Meng ( 19981 ) for a review from the viewpoint of Statistics). 



as 



3.2.1 General structure of adaptive methods 

In adaptive biasing Markov chain Monte Carlo methods, a time- varying biasing potential 
A t {z) is considered. The biasing potential A t is sequentially updated in order to converge to 
the free energy A in the limit. As already mentioned in the introduction, the term "adaptive" 
refers in this paper to the dynamic adaptation of the targeted probability measure, and not 
of the parameters of a Markov kernel used in the simulations. Specifically, at iteration t, the 
time-varying targeted density is 

W*) = 4- exp {-V(9) + (A t o £)(<?)} . (8) 

An adaptive MCMC algorithm simulates a non- homogeneous Markov chain (9t), t = 1, 2, . . ., 
using the two following steps at iteration t: 

(1) a MCMC move according to the current target tta ± defined in (jHJ), 

Ot-KtiOt-!,-), 

where Kt is a Markov kernel leaving iiA t invariant; 

(2) the update of the bias to A t+ i, using a trajectory average, see Section [3.2.21 below. 

The first step may be done using a Hastings-Metropolis kernel for instance, see Figure [2j 

Before explaining the second step, let us mention how the discretization of the reaction 
coordinate values for the biasing potential At is done in practice. A simple strategy, which 
we adopt in this paper, is to use predefined bins, and approximate the biasing potential At or 
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its derivative A' t (with respect to z) by piecewise constant functions. Specifically, we consider 
N z bins of equal sizes Az, 

n z -i _ z 

[•^minj ^max] — [^J %i — -^min ~i~ iAz, Az — — . 

i=0 z 

Other discretizations may of course be used, but this is not the focus of this paper. 



3.2.2 Strategies for updating the bias 



Recall that the bottom line of adaptive methods is that At should converge to A. Adaptive bi- 
asing methods can be classified into two categories, depending on whether it is the free energy 
At(z), or its derivative A' t {z) with respect to z, which is updated. Instances of the first strat- 
egy, r oaJled_jidjnptiye^ methods, include nonequili brium metadynam 



egy, called adaptive biasing potential (Ap r) methods, include nonequili brium metadynam- 
ics (Bussi et all 2006 ; Raiteri et al. . 20061 ). the Wang-Landau algori thm ( Wang and Landau 



20011) 



a 



force (ABF) methodology (jDarve and Pohorilld . l200ll ; lHenin and Chipotl . 12004 ; iLelievre et al 



and Self-Healing Umbrella Sampling ( Marsili et al. . 20061). The adaptive biasing 



20071 ). which is the main adaptive method used in this paper, is an instance of the second. 
From now on, we focus on two particular strategies, one belonging to the ABP class, and 
another to ABF class. 

The ABP strategy we choose is based on iMarsili et all (|2006h . In particular, we do not 
use the Wang-Landau algorithm, which is, to our knowledge, t he on l y ABP metho d dis cussed 
before in the Statistical literature; see e.g. Atchade and Liu ( 2010l ). Liang ( 2005 ) and Liang 
l|20ld ). indeed, a delicate point with the Wang-L andau approach is how to choose the vanish- 
ing rate of the "gain factor"; see e.g. iLiang) (120051 ). On the other hand, Self-Healing Umbrella 
Sampling does not involve such an additional parameter to be tuned. It consists in updating 
A t as follows. The biasing potential for z G {zi,Zi + \) is initially set to exp{—Ao(z)} = l/N z 
(for all i G {0, . . . , N z — 1}), and then updated for all i G {0, . . . , N z — 1} and for t > 1 as 



t-i 



Mz G (zi,z i+1 ), exp{-A t (z)} 



1 + Yl h^mx^+i} ex P [ 



Ajo 



(9) 



the normalization factor Z% being such that 



N z -i 
Az £ 

i=0 



cxp 



-At 



Z{ + Zi+l 



1. 



The method may be understood as follows: If 9j was indeed distributed according to tt^ at 
all times j, then the weight exp [ — Aj(£(9j))\ , proportional to TT(9j)/TTA j {9j), would correct 
for the bias introduced at iteration j, and exp{— At(z)} would indeed be an estimator of the 
probability that £(#) G (2^,24+1) when 6 is distributed according to the unbiased density 7r, 
that is exp{— A(z)}. Of course, since At is varying in time, it is not exactly true that 9j is 
distributed according to 717^, but this reasoning yields at least an intuition on the way the 
method is built. It is easy to check that, provided the method converges, the only possible 
limit for the biasing potential At is the free energy A (up to the discretization error introduced 
by the binning of the reaction coordinate values). Generaliza tions of the update e quation ([9]) 
leading to higher efficiencies have recently been proposed in dPickson et all l20ich . 
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Alternativ e ly, th e ABF strategy (jDarve and Pohorilld . l200ll ; iHenin and Chipotl . 12004 ; 



Lelievre et al.l . 120071 ) is based on the following formula for the derivative of A (called the 



A'{z) = F{z)=W(f{9) m 



mean force): 



where / admits an analytic expression in terms of £ and V: 



/ 



|V£| 



div 



|V£| 



(10) 



(11) 



wher e V is the gradient operator, and div is the divergence operator. As shown in (jLelievre et all 
20101 ). formul as (fT0l-(fTTD may be derived from the definiti on (151 ) of the free energy using the co- 



area formula ( Evans and Gariepy . 19921 : Ambrosio et al 
it is easy to prove that (fT0l) - (fTT]) hold, and that / = dV/dq\ 



. In the simple case £(0) = q\, 
As mentioned above, the con- 
ditional measure of it with respect to £(0) = z is the same as the conditional measure of irA t 
with respect to £(9) = z. Thus, a natural ABF updating strategy is to compute at iteration 
t the following approximation of the mean force: for all i € {0, . . . , N z — 1} and for t > 1, 

t-l 



Vz € (zi,z i+ i), F t (z) 



L {zi<^0j)<z i+ i} 

J'=l 



t-1 



(12) 



{«i<e(^-)<*i+i} 



From this approximation of -F, an approximation of the free energy A can be recovered 
by integrating Ft(z) in z. The consistency of the method may be understood as follows: If 9j 
was distributed according to ir^. at all times j, then we would have F t = F and hence A t = A 
(up to an additive constant). Besides, as above, it can be shown that, provided the method 
converges, the biasing potential At converges to (a discretized version of) the free energy A, 
up to an additive constant. 

The interest of ABP compared to ABF is that it does not require computing / given 
by (jlip . which may be cumbersome for some £ such as £ = V (minus the log posterior 
density). On the other hand, it is observed that the ABF method yields very good results 
since the derivative of the free energy is approximated, so that after integration, the adaptive 
biasing potential is smoother in z for ABF than for ABP. In the following, we use the ABF 
method, except when we consider as a reaction coordinate the potential V. In this case, the 
ABP method is used. 

The convergence of the adaptive bias ing force method (for a slightly different dynamics), 
has been studied in ( Lelievre et al. . 20081 ). and its asso ciated discretization u sing many replica 



of the simulated Markov chain has been considered in iJourdain et al.l (l2010h. For re finements 



concerning the implementation of such a strategy, we refer to iLelievre et al.l (|2007l ) 



3.2.3 Practical implementation of adaptive algorithms 

To summarize the method, we give in Figure [2] the details of the ABF algorithm. A similar 
algorithm is used in the ABP case. In practice, we stop the algorithm when the bias At is 
no longer significantly modified from t = n to t = n + N cvg , where is a fixed number 
of iterations between two convergence checks. See Section 15.1.11 for an illustration of this 
strategy. 
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The so-obtained bias is then considered as a good approximation of the free energy, and 
used in the subsequent importance sampling step, as explained in Section 13.31 A perfect 
convergence is not required, in the sense that the bias does not need to be estimated very 
accurately, as it is removed in the final importance sampling step. Moreover, note that the 
biasing potential A t needs to be computed only up to an additive constant which does not 
play any role in the overall procedure. 

Figure 2: The Markov chain Monte-Carlo adaptive biasing force algorithm. 

Algorithm 1. Consider a reaction coordinate £. Starting from some initial configura- 
tion Oq and the biasing potential Aq = 0, iterate on t > 1: 

(a) Propose a move from 9 t -\ to 9' t according to the transition kernel ^(6t-i,9' t ) d9[; 

(b) Compute the acceptance rate 



(c) Draw a random variable Ut uniformly distributed in [0, 1] (Ut ~ W[0, 1]); 

(i) if Ut < at, accept the move and set 9t = 9' t ; 

(ii) if Ut > at, reject the move and set 9t = Ot-i- 

(d) Following (|12j) . update the biasing force, hence the biasing potential At+i. 

(e) Go to Step (a). 



3.3 Reweighting free-energy biased simulations 

Upon stabilization of the adaptive algorithm at iteration T, an estimate A = At of the biasing 
potential A is obtained, from which one defines the biased density 



To sample the true posterior ir, we use the following simple strategy. We simulate a standard 
MCMC algorithm, e.g. a random walk Hastings-Metropolis algorithm, targeted at the biased 
posterior density tt, and then perform an importance sampling step from tt to w, based on 
the importance sampling weights: 




where the biased probability density tt^ is defined as 



7r At (0)oc7r(0)exp [Ato£(0)]; 




(13) 




(14) 
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Prom the MCMC chain (6t)t>i targeted at tt, the expectation with respect to tt of a test 
function h can thus be estimated as (see ([7])): 



■max 



E"(hw) 



E WtMOt) 



W(h) 



t=l 



(15) 



E <°t) 



t=i 



where t max is the number of iterations of the MCMC chain. 
3.4 Reaction coordinates with unbounded values 

There are many cases when the reaction coordinate takes values in an unbounded interval JP . 
Here should be understood as the support of the distribution of the random variable £(#) 
when 6 ~ ir(9)d9. One may think of = fi\ as an example for the mixture posterior 
distribution case (in which case J" = M), and see Section [4.21 for more examples. 

It is not possible to apply the above procedure on the whole interval J^. Some truncation 
is required for at least two reasons. First, numerically, it would be difficult to discretize in 
space a function defined on an unbounded domain. Second (and more importantly) the use 
of the full free energy over J? would lead to a density tta which is not integrable (since the 
uniform law over .J? is not well defined as a probability distribution). 

We therefore resort to the following strategy. First, we choose some truncation interval 
[z mm , Zmxr] ■ Then, in the adaptive MCMC algorithm (which calculates the free energy), we 
reject any point 9 such that £(#) fall outside of this interval. This is tantamount to restricting 
the sampling space with the constraint z m i n < < z max . In this way, one obtains an 
estimate A(z) of the free energy A(z), but only for z € [z m \ n , z ma _ x }. 

When A is obtained, we simply extend its definition outside z 6 [z m i n , z max ] as follows: 
A(z) = A(z m m), for z < z m in, A{z) = i(z max ), for z > z max . Finally, we run a standard 
MCMC sampler targeting the distribution tt = ir^, as described in the previous section. Note 
that the biased distribution tt is defined over the whole parameter space f2 (in particular, no 
additional rejection step is needed in the sampling of this distribution). 

In practice, one should choose an interval [z m i n , z max ] which is not too large, but at the 
same time such that the probability (with respect to tt) of the event z m j n < £(0) < z max is 
close to one: 



so that J? \ [zmi n , z max ] is barely visited (see also Section [4. 1.2[) . This is one of the practical 
difficulties that we shall discuss in the next section. 

4 Bayesian inference from free energy biased dynamics 

In this section, we explain how to perform Bayesian inference for the univariate Gaussian 
mixture model described Section 12.11 that is, how to compute quantities such as posterior 
expectations and ratios of marginal likelihoods (equal to ratios of normalizing constants Zk 




exp(— A(z)) dz 



1, 



(16) 




exp(— A(z)) dz 
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defined in @ ) , using the free energy associated to a given reaction coordinate to build an im- 
portance function. The Gaussian mixture model corresponds, in the notation of Section [3l to 
ir(9) = p(9\y) oc p(0)p(y\6), hence V(8) = — log{p(9)p(y\8)} . For a given reaction coordinate, 
and a given estimate A of the free energy A, the free-energy biased probability distribution is 

p{9\y) (xp(9\y)/w(6) otp(9\y)exp (,4o£ 

where w is defined by (fT4"|) , 

We start by listing the criteria we use to assess the quality of the importance sampling 
procedure in Section 14. 11 As mentioned in the introduction, the strategy to sample the 
posterior distribution ([2]) consists of three steps: choosing a reaction coordinate, computing 
(an approximation of) the free energy associated to this reaction coordinate, and using the 
free energy to build an importance sampling proposal distribution according to (|13p . The 
previous section was devoted to the second and third steps. We discuss the first step in 
Section 14.21 for the mixture model at hand. Section 14.31 presents an extension of the method 
to the computation of the ratio of normalizing constants associated to different values of the 
number of components K, in order to perform model choices between models corresponding 
to different number of components. Note that we discuss in this section the theoretical 
efficiency of the whole approach. These discussions are supported by numerical experiments 
in Section [5j 

4.1 Criteria for choosing the reaction coordinate 
4.1.1 General criteria 

We consider the following criteria for evaluating the practical efficiency of the whole procedure, 
for a given choice of the reaction coordinate £: 

(a) In the execution of the (either ABF or ABP) adaptive algorithm, how fast does the 
approximate free energy At converge to its limit A? 

(b) How efficient is the importance sampling step from the biased distribution to the originally 
targeted posterior distribution? Actually, this criterion is twofold: 

(bl) How efficient is the MCMC sampling of the biased density p(9\y)7 

(b2) How representative are the points simulated from the biased distribution with re- 
spect to the target posterior distribution? (i.e. how many of these points are 
assigned non- negligible importance weights?) 

(c) A more practical criterion is (in the case of a reaction coordinate with values in an un- 
bounded domain): How difficult is it to determine, a priori, an interval [z m - m , z max ] for the 
reaction coordinate values, which ensures good performance with respect to Criteria (a) 
and (b) and which satisfies (fT6j) ? 



Criterion (b2) is discussed in the next section. Criteria (a) and (bl) can actually be show n 
to be closely related, at least for some family of adaptive methods, see Lelievre et al. ( 20081 ) . 



Roughly speaking, an adaptive algorithm yields quickly an estimate of the free energy, if and 
only if the free energy is indeed a good biasing potential, in the sense that the dynamics 
driven by the biased potential converges quickly to a limiting distribution. Theoretically and 
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as mentioned above, a sufficient condition for an efficient sampling is that the conditional 
probability distributions ir^(dx | = z) are easy to sample, at least for some values 



of z (namely they do not have very separated modes). We refer to iLelievre et al. 
Lelievre and Minoukadehl (|201ll ) for precise mathematical results. 



(120081 ): 



Numerically, to assess the convergence of adaptive methods, we recommend the following 
two basic checks: (i) that the output of the adaptive algorithm has explored the full range 
[zm\m Zmax] and has a distribution which is close to uniform; and (ii) using the criterion 
mentioned in the introduction, and specifically for mixture posterior distributions, that the 
algorithm has visited the K\ symmetric replicates of any significant local mode. The same 
convergence checks can be applied to the MCMC dynamics targeted at n. 



4.1.2 Efficiency of the importance sampling step 

We give here a way to quantify Criterion (b2). To evaluate the performance of the importance 
sampling step, we compute the following efficiency factor 



EF 



t=i 

where w{9) is defined in (|14p . and where {0t}t>o denotes the MCMC sample targeting the 
biased posterio r p(Q\y), as descri bed in Section 13.31 The efficiency factor is the Effective 
Sample Size of iKong et al.l (]1994h divided by the number of sampled values. This quantity 
lies in [0, 1]. It is close to one (resp. to zero) when the random variable w(8) has a small 
(resp. a large) variance. Indeed, it is easy to check that 



EF 



V(EtH) 2 



+ 1 



Varr(w) 



t=i 



where the latter quantity is the empirical variance of the sample {w(9t)}i<t<T, and Er(w) = 
Ylt=i w(6 t )/T its empirical average. 

We now propose an estimate of the efficiency factor in terms of the converged bias A 
only, which may therefore be computed before the MCMC algorithm targeting the biased 
posterior is run, and the importance sampling step is performed. This estimate is based on 
the fact that, with respect to p(0\y), the marginal distribution of £ is approximately uniform 
over [z m i n , z max ] . For well chosen z m \ a and z max , £(#t) hardly visits values out of the interval 
\z m\n , Zmx*\ (see (fTB]) above) and thus 



Varr(u;) 
(EtM) 5 



1 



'max ^mm t/2 m ; 



exp 



{-A(z)} 



1 



exp < —A > dz 



^max ^min J z m j n 



exp 



{-A(z)} 



dz 
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which provides a justification for the following "theoretical" efficiency factor: 

exp |— j4(z)| dz\ 

FFtheoretical = /-Zmax ~, ~ ^ 1 (^U 

(^max - ^min) / exp |-2^(z)| (fe 

" ^min 

The agreement between theoretical and numerically computed efficiency factors in our simula- 
tions is very good, see Tables [TJ [2] and d] in Section[57L3j Thus, the theoretical efficiency factor 
allows for a quick check that the subsequent importance sampling is reasonably efficient. 

From the expression (|17p . it is seen that the efficiency factor is close to one when A is 
close to a constant. Thus, Criterion (b2) mentioned in the previous section is likely to be 
satisfied if the free energy associated to £ has a small amplitude, i.e. max A — mm A is as 
small as possible. 

4.2 Practical choice of the reaction coordinate 

We now discuss the practical choice of the reaction coordinate £ : 8 — > M. in the mixture 
posterior sampling context, with respect to the criteria listed above. We discuss successively 
the following four possible choices: £(0) = £(0) = V(8) = — log{p(9)p(y\6)} , £(0) = q\ 
and £(0) = (3. This discussion is also illustrated numerically in Section [5.1.21 

The requirement that the multimodality of the target measure conditional on £(0) = z 
is much less noticeable than the multimodality of the original target measure rules out the 
choice of \i\ as a good reaction coordinate since, conditionally on fj,±, the posterior density 
still has at least {K — 1)! modes, as the components 2 to K remain exchangeable. Numerical 
tests indeed support these considerations, see below. 

A more natural reaction coordinate is minus the pos terior log-density £(0) = — logjp(0)p(?/|0)|, 
in the spirit of the original Wang-Landau algorithm ( Wang and Landaul . 200 lbl 3). Indeed, 



exploring regions of low posterior density should help to escape more easily from local modes. 
Unfortunately, determining a range \z m \ a , z max ] of "likely values" (with respect to the posterior 
distribution) for such functions of 8 is not straightforward; see Criterion (c) above. Moreover, 
since the posterior density is expected to be multimodal and difficult to explore, there seems 
to be little point in performing MCMC pilot runs in order to determine [ zmm , z max ]. A conser- 
vative approach is to choose a very wide interval [^mi n , 2max]j but this makes the subsequent 
importance sampling step quite inefficient. In our simulations, we report satisfactory results 
for this reaction coordinate, but with the caveat that our choice for [z m [ n , z max ] was facili- 
tated by our different simulation exercises, based on several reaction coordinates. Another 
practical difficulty we observed in one case is that the estimated bias is quite inaccurate in 
the immediate vicinity of the posterior mode, because the free energy tends to increase very 
sharply in this region, see Section 15.21 for more details. 

The choice = q\ is satisfactory with respect to Criterion (c): The range on which 
it can vary, namely [0, 1], is clearly known. With respect to (a), this choice looks appealing 
as well, since forcing q\ to get close to 1 should empty the K — 1 other components, which 
then may swap more easily. Unfortunately, we observe in some of our experiments that 
the dynamics biased by the free-energy associated with this reaction coordinate is not very 
successful in terms of mode switchings, see Figures [6] and IT3l 

Finally, £(0) = f3 appears to be a good trade-off with respect to our criteria, at least in 
the examples wg consider below. Concerning the determination of the interval [-^mini ^max] 
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(Criterion (c)), since (3 determines the order of magnitude of the component variances a\ = 
A^ 1 , there should be high posterior probability that (3 is a small fraction of R 2 , where R is 
the range of the data. For instance, we obtain satisfactory results in all our experiments with 
[^min, Zmax] = [-R 2 /2000, i? 2 /20] . Concerning Criterion (a), we observe that the choice £ = f3 
performs well (see the numerical results below). 

We propose the following explanation. Since the A^'s have a Gamma(a,/3) prior, large 
values of f3 penalize large values for the component precisions A&, or equivalently penalize 

— 1/2 

small values for the component standard deviations = \ k . If j3 is large enough, the 
Gaussian components are forced to cover the complete range of the data, and thus can switch 
easily. This interesting phenomenon is illustrated by Figure [81 see below for further precisions. 
In other words, a "good" reaction coordinate £ should be such that the conditional probability 
distributions 7r^(dx | = z) are less multimodal than tt, at least for some values of z. For a 
theoretical result supporting this interpretation, we refer to Lelievre and Minoukadeh ( 201ll ). 



4.3 Computing normalizing constants and model choice 

In this section, we discuss an extension of the method to perform model choice between models 
with different numbers of components. The principle is to compute the normalizing constant 
Zk of the posterior density for different values of K, see ([!]) for a de finition of Zk- More 



preci sely, it is sufficient to evaluate Zk/Zk-i for a given range of K (see (jRobert and Casella 



20041 . Chap. 7) for a review on Bayesian model choice). 

We propose the following strategy. The estimation of Zk/Zk-i can be performed by first 
estimating Zk/ Z, then estimating Zk-i/ 'Z, and finally dividing the two quantities. A simple 
estimator of Zk/Z (where Z is the normalizing constant in (I13j) ) is given by 

1 T 

j* = -5>(0 t ), 

i=l 

where {9 t }t>o is a sample distributed according to the biased probability p(9\y) (with K 
normal components). This formula is based on the fact that the expectation of w{9) = 
exp{— A o £(#)} with respect to p(0\y) is Z^/Z. 

Let 9_k denote the parameter vector obtained by removing in 9 the parameters attached 
to a given component k, and replacing the probabilities q\ (for I ^ k) by q\ = qi/(l — qk)- 
Let p(y\9-k) denote the likelihood of the model with K — 1 components, and parameter 9_^. 
Then the following quantity 

K T 

Ik-i = -g XT 1} <-i,k, Iic-i,k = j, w_ k {9 t ), 
k=i t=i 

where {9t}t>o is the same Markov chain as above, and 

is an estimator of Zk—i/Z. 

The estimators Ik and Ik^ are remin i scent of the birth and death moves of the reversible 
jump algorithm of Richardson and Green ( 19971 ). where a new model is proposed by adding 



or removing a component chosen at random. The difference is that the biased posterior 
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p(9\y) acts as an intermediate state between the posterior with K components, p{9\y) and 
the posterior with K — 1 components (or more precisely, the posterior with K — 1 components 
times the prior of a .ff-th "non-acting" component, in order to match the dimensionality of 
both p(9\y) and p(9\y)). 

In our numerical experiments, the estimator of Zr/Zk-i obtained from this strategy 
performs well, see Section [5] (in particular Table [3|). 



5 Numerical examples 

In our experiments and as explained above, we use the following approach. First, we run an 
adaptive algorithm (ABF, except for £(0) = V(9) = — log{p(9)p(y\9)} , in which case we use 
ABP), for a given choice of the reaction coordinate £, and a given interval [z m i n , z m ax], until a 
converged bias A is obtained. Second, we run a MCMC algorithm, with p(9\y) given by (|13p 
as invariant density. Third, we perform an importance sampling step from p(9\y) to p(9\y), 
the unbiased posterior density. See the introduction of Section U] for the notation. 

The quality of the biasing procedure is assessed using the criteria mentioned in Sec- 
tions 14.11 This consists in: (i) checking that the reaction coordinate values are uniformly 
sampled over [z m ; n , z max ], (ii) checking that the output is symmetric with respect to labellings, 
and many switchings between the modes are observed and (iii) computing the efficiency factor 
(a good indicator being the estimator (117j) defined in terms of A). 

In the first step of the method (approximation of the free energy using adaptive algo- 
rithm), we deliberately use the simplest exploration strategy, namely a Gaussian random 
walk Hastings- Metropolis update, with small scales (see below for the precise values). This is 
to illustrate that the ability of adaptive algorithms to approximate the free energy does not 
crucially depend on a finely tuned updating strategy. 

In the second step, we run a Hastings-Metropolis algorithm targeted at the biased poste- 
rior, using Cauchy random walk proposals, and the following scales: = ii/1000, t v = 2/R 2 , 
rg = 2 x 10 -5 ait! 2 , where R is the range of the data, which leads to acceptance rates between 
10% and 30% in all cases. 



5.1 A first example : the Fishery data 

We fir st consider the Fishery data of Titterington et al. ( 19861 ) (see also Friihwirth-Schnatter 
(j2006l )). which consist of the lengths of 256 snappers, and a Gaussian mixture model with 
K = 3 components; see Figure Q] for a histogram. 



5.1.1 Convergence of the adaptive algorithms 

In the adaptive algorithm, we use Gaussian random walk proposals with scales r q = 5 x 10 -4 , 
Tfj_ = 0.025, t v = 0.05 and rg = 5 x 10~ 3 . These parameters were also used to produce 
the unbiased trajectory in Figure [U We illustrate here the convergence process in the case 
£(#) = (3, using the ABF algorithm described in Section [33| with z m [ n = 0.05, z max = 4.0 
and Az = 0.01. 

First, we plot on Figure [3] the trajectory of (fi\, /12, ^3) and /? for T = 10 s iterations. 
With the ABF algorithm, the values visited by /3 COV6f ttie whole interVcll [^min; ^max 

1, and 

the applied bias enables a frequent switching of the modes (observed here on the parameters 
(pi, /12, ^3))- The trajectories for (^1,^2^3) should be compared with the ones given on 
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Figure [H where no adaptive biasing force is applied (note that the x-axis scale is not the 
same on both plots). 




Figure 3: Trajectories over the 10 8 first iterations of the ABF algorithm for the choice = 
f3, for (fn, fj,2, [Js) (left) and for the f3 variable (right). 

Second, we monitor the convergence of the biasing potential. To this end, we run a 
simulation for a total number of iterations T = 10 9 , and store the biasing potential every 
iV cvg = 10 6 iterations. The distance between the current bias and the bias at iteration 
t — N cvg is measured by 



N z -l 

(At,-A t . Ncve ,-c) 2 , (19) 

8=0 

where At^ denotes the value of the bias in bin i, i.e. At(z) = At t i if z 6 (zi,Zi+x)- Since the 
potential is defined only up to an additive constant, we consider the optimal shift constant 
c which minimizes the mean-squared distance between the two profiles. An elementary com- 
putation shows that this constant is equal to the difference between the averages of At and 
At~N cvg - We finally renormalize this distance as 

Sj 

et = , =■ 
vEi=o Af^ 

The relative distance et as a function of the iteration index t is plotted in Figure [H Correct 
approximations of the bias are obtained after a few multiples of N cvg iterations (the relative 
error being already lower than 0.1 at the first convergence check). We emphasize again that 
we did not optimize the proposal moves in order to reach the fastest convergence of the bias. 
It is very likely that better convergence results could be obtained by carefully tuning the 
parameters of the proposal function, or resorting to proposals of a different type. 

On Figure[5l we plot the free energies associated to the four reaction coordinates mentioned 
above, as estimated by adaptive algorithms. We recall that, for £(9) = /3, z m { n = 0.05, 
Zmax = 4.0 and Az = 0.01. For £(0) = — log{p(9)p(y\9)} , we used an ABP algorithm, with 
£min = 500, z max = 540 and Az = 0.1. For £(0) = q\ and £(#) = fix, we used ABF, with 
respectively z min = 0, z max = 1, Az = 0.005 and z min = miny; = 2.5, z max = maxyj = 13, 
Az = 0.05. Recall that the so-obtained bias is minus the marginal posterior log-density of £. 
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O.Oe+00 1.0e+08 2.0e+08 3.0e+08 4.0e+08 5.0e+08 

Number of iterations 

Figure 4: Convergence of the logarithmic relative distance log(ej)/log(10) (see (fT9|) ). as a 
function of the number of iterations. 

This is why the three important modes in the \x parameter can be read from the corresponding 
bias in Figure Note also that there is a lower bound on the admissible values of minus the 
log-posterior density, hence the plateau value of the corresponding bias for low values of the 
reaction coordinate corresponding to unexplored regions. 




0.0 1.0 2.0 3.0 4.0 500 510 590 530 540 

Beta Negative of the log-posterior density 



Figure 5: The fishery data. Free energies obtained for the reaction coordinates: = \x\ 
(top left), i(Q) = q x (top right), £{0) = (3 (bottom left) and f(0) = - log{p(9)p(y\0)} (bottom 
right). 

5.1.2 Efficiency of the biasing procedure 

We now discuss the results of the MCMC algorithm targeted at the biased posterior distri- 
bution, using the free energies computed above. In Figure [6J we observe that all the biased 
dynamics are much more successful in terms of mode switchings than the unbiased dynam- 
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Reaction coordinate 




-\og{p{9)p(y\9)} 


Qi 


Mi 


EF (numerical) 


0.17 


0.16 


0.48 


0.04 


EF (theoretical) 


0.179 


0.178 


0.454 


0.079 



Table 1: Efficiency factor for various choices of reaction coordinates, in the case K = 3. 

ics (see Figure H]) . More precisely, the dynamics biased by the free energy associated with 
£(#) = — ^og{p(9)p(y\9)} is the most successful in terms of switchings, but the dynamics with 
£(0) = j3 performs correctly as well. The dynamics with £(0) = q\ seems to be less successful. 
In the case £(6) = (Mi, one value of the parameters \i is forced to visit the whole range of 
values. The lowest mode (around Li = 3) is not very well visited here. 

The efficiency factors are presented in Table [TJ They are rather large, which shows that 
the importance sampling procedure does not yield a degenerate sample. The choice £(#) = q\ 
is the best one, but £(6>) = j3 and £(#) = — log{p(9)p(y\9)} give comparable and satisfactory 
results as well. The choice £(9) = fii on the other hand is a poor choice in this case. 

In view of these results, it seems that £(0) = — log{p(9)p(y\9)} is the best choice, with the 
problem however that we had to slightly modify the bias for the lowest values of the reaction 
coordinate because of the too sharp variations of the bias in this region. (Our numerical 
experience is indeed that the bias obtained from minus the log-posterior density is sometimes 
difficult to use directly.) On the other hand, the procedure is more automatic for £(#) = q\ 
and (3, the latter reaction coordinate being a much better choice when it comes to mode 
switchings. 

We now focus on £(9) = (3. As explained in the introduction, a good sampler should 
visit all the possible labellings of the parameter space. This implies in particular that the 
marginal posterior distributions of the simulated component parameters should be nearly 
identical. This is clearly the case here, see the scatter plots of the l-in-10 4 sub-sample of 
the simulated pairs (//j,logAj), i = 1, ... ,3 in Figure [TJ The top left picture in Figure [7] also 
demonstrates that the biased dynamics indeed samples uniformly the values of (3 over the 
chosen interval [z m i n , z max \. 

Finally, Figure [8] illustrates why the reaction coordinate £(9) = f3 allows for escaping from 
local modes; see the discussion in Section \4. 21 Each plot represents a sub-sample of the simu- 
lated pairs (fJ.k,t, log ^k,t) (where the subscript t denotes the iteration index while the subscript 
k labels the components), restricted to fit being in intervals, from left to right, [0, 0.5], [1.5, 2] 
and [3.5,4]. Since the bias function depends only on (3, these plots are rough approximations 
of the posterior distribution conditional on (3 = 0.25 (its posterior expectation), (3 = 1.75 and 
(3 = 2>. In the leftmost plot of Figure El (3 is fixed to its posterior expectation and the three 
modes are well separated. As (3 is forced to take artificially large values (in the sense that 
the posterior probability density of such values is very small) , the three modes get closer and 
eventually merge. 

5.1.3 Larger values of K and model choice 

We apply our approach to other values of K, namely K = 4 to 6, in the case £(#) = (3. Table[2] 
reports the efficiency factor as a function of K. These factors remain quite satisfactory, which 
is related to the fact that the amplitude of the free energy (difference between the maximum 
and the minimum values) associated to this reaction coordinate is not too large over the 
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Figure 6: The fishery data. Trajectories of (/ii, //2, ^3) for the biased dynamics for different 
reaction coordinates. Top left: £(0) = fi\. Top right: £(0) = q\. Bottom left: £(0) = (3. 
Bottom right: £(0) = - log{p(8)p(y\6)} . 



K 


3 


4 


5 


6 


EF (numerical) 


0.17 


0.18 


0.17 


0.16 


EF (theoretical) 


0.179 


0.195 


0.180 


0.171 



Table 2: Efficiency factors, for various values of the number of components considered in the 
mixture, with the choice £(0) = f3. 

chosen interval [z m i n ,Zmax]j and does not change dramatically, see the profiles obtained for 
different values of K in Figure [H 

Figure flOl represents the marginal posterior distribution of (/ii,logAi), for K = 4,5,6. 
These plots are obtained by resampling 2000 points from the output of the MCMC targeting 
the biased posterior, with probability proportional to the importance sampling weight defined 
in (I14|) . In each case, we checked that the MCMC output is symmetric with respect to label 
permutations. 

Table [3] reports, for K = 3, . . . ,6, the estimated log-Bayes factor for choosing a mixture 
model with K components against a mixture model with K — 1 components, which equals 
log Zk/Zk-\i assuming equal prior probability for different values of K. The reported error 
levels in Table [3] correspond to 90% confidence intervals, which are deduced from repeated 
independent runs of T = 10 7 iterations from the same MCMC algorithm targeting the biased 
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Figure 7: Top left: Histogram of simulated /3's and estimated marginal posterior density of 
f3. Remaining pictures: Scatter plots of simulated values for (/Uj,logAj), for i = 1,2,3 when 
(3 is used as a reaction coordinate. 



K 


log Zk/Zk^ 


error 


3 


7.1 


± 0.2 


4 


4.2 


± 0.1 


5 


1.5 


± 0.1 


6 


0.9 


± 0.1 



Table 3: Log Bayes factors for comparing model with K components against K — 1 compo- 
nents, for K = 3, 6; estimation error as evaluated from independent MCMC runs. 

posterior. The estimation error is quite small, despite being based on importance sampling 
steps in high dimensional spaces. 

5.2 A second example: the Hidalgo stamps data 

An other well-known benchmark for mixtures is the Hidalgo stam ps dataset, first studied 
by llzenman and Sommer (| 19881 ) (see also e.g. basford et ail (jl997f n. which consists of the 



thickness (in mm) of n = 485 stamps from a given Mexican stamp issue; see Figure Q] for a 
histogram. (For convenience we multiplied the observations by 100.) We focus our presen- 
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3 2 4 6 8 10 12 ^2 4 6 8 10 12 3 2 4 6 8 10 12 

R /*i /*i 

Figure 8: Simulated pairs (fi\, log Ai) conditional on, from left to right, /3 S [0, 0.5], /3 € [1.5, 2] 
and /3 G [3.5,4], see Section I5.1.3I for more details. 
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Figure 9: Estimated bias (free energy), for K = 3, . . . , 6, and £(0) = /3. 

tation on the challenging case K = 3. For other values of K between 4 and 7 our approach 
performs better than for K = 3. For the sake of space the corresponding results are not 
reported. 

This example is more challenging than the previous one, presumably because the number of 
observations is larger, which makes the likelihood more peaked. A clear sign of the increasing 
metastability is the increase in the free-energy barriers. For the reaction coordinates = q\, 
£(0) = /3, £(0) = — log{p(0)p(y\0)} , we had to run the adaptive algorithm for T = 10 9 
iterations in order to obtain a converged bias and recover a biased posterior sample which is 
symmetric by labelling. Again, a more elaborate proposal strategy for the Hastings-Metropolis 
step in the adaptive algorithm would be likely to stabilize the bias faster. As an illustration, 
Figure [TT1 presents the trajectories of (^1,^2,^3) sampled by the adaptive algorithm, with 
random walk scales: r q = 0.001, = 0.05, t v = 0.1 and rp = 0.005. The trajectories should 
be compared to the ones depicted in Figure [H which are obtained with the same proposal, 
but without any biasing procedure. 

In Figure [T2l we represent the biases obtained with various choices of the reaction coor- 
dinate. In the case £(0) = (3, we set z m \ n = 0.005, z max = 2.5 and Az = 0.005. For £(#) = qi, 
we consider z m ; n = 0, z max = 1 and Az = 0.005. Finally, for £(0) = — log{p(0)p(y\0)} , we 
choose z m i n = 720, z max = 780 and Az = 0.1. 



23 



2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 

ft ft ft 

Figure 10: Marginal posterior distribution of (//i,logAi), from left to right, for K = 4, 5 and 
6, as represented by 2000 points resampled from the MCMC output. 




2.5e+08 5.0e+08 7.5e+08 1. Oe+09 2.5e+08 5.0e+08 7.5e+08 1.06+09 

Number of iterations Number of iterations 

Figure 11: Sampled trajectories for (//i, fi2, ^3) during the adaptive biasing procedure. Left: 
ABF trajectory when the reaction coordinate is f3. Right: ABP trajectory when the reaction 
coordinate is minus the log-posterior density. 



Reaction coordinate 


P 


-\og{ P {d)p{y\e)} 




EF (numerical) 


0.02 


0.24 


0.23 


EF (theoretical) 


0.06 


0.13 


0.18 



Table 4: Efficiency factor for various choices of reaction coordinates, in the case K = 3. 

Biased trajectories are presented in Figure [13] for £(0) = qi, £(0) = (3, and £(#) = 
— log{p(6)p(y\8)}. Efficiency factors are reported in Tabled! The results show that, in 
terms of mode switching, £(6>) = — log{p(9)p(y\6)} is the best choice. The choice £(6>) = q±, 
although it leads to the highest efficiency factor, is a poor choice since very few switchings 
are observed; in particular, the mode starting around 7.5 does not change during the first 
2.5 x 10 7 iterations. Such transitions are observed in the case £(#) = j3. 
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Negative of the log-posterior density 



Figure 12: Hidalgo stamps problem. Free energies obtained for the reaction coordinates: 
£(0) = qi (top left), f(0) = p (top right) and £(0) = - log{p(0)p(y|0)} (bottom). 
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Figure 13: Hidalgo stamps problem. Trajectories of (^1,^2,^3) of the biased dynamics. Top 
left: reaction coordinate £(0) = q±. Top right: £(0) = (3. Bottom: £(0) = — log{p(8)p(y\9)} . 
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6 Conclusion 



We showed in this paper how to sample efficiently posteriors of univariate Gaussian mixture 
models, using a free energy biasing approach, which can be summarized as follows: 

(1) We choose a reaction coordinate (a function £ of 6). 

(2) We run an adaptive MCMC sampler, in order to compute an estimation of the free 
energy A associated to £. 

(3) From the estimated free energy A (the output of the previous step), we define the biased 
density tt = n^. We run a standard MCMC sampler that targets tt (say a Gaussian 
random- walk Hastings-Metropolis algorithm). 

(4) We do an importance sampling step, from tt to tt to remove the bias and recover the 
true posterior tt. 

In the particular case of univariate Gaussian mixture models, a good choice for the reaction 
coordinate is £ = j3 or = —log{p(9)p(y\0)}. When j3 is chosen, it is easy to estimate 
an interval of typical values for f3 as [c\R? , C2R 2 ] (with R the range of the data, and c\ < C2 
small constants, say c\ = 1/2000 and C2 = 1/20), while the determination of such an interval 
for — log{p(6)p(y\8)} does not seem to be straightforward. 

We think that the same ideas may be applied to other mixture models. For instance, 
Figure [14] plots the posterior density of a two-component Poisson mixture model, conditional 
on different values for the hyper-parameters. Specifically, p(yi\9) = giPoisson^; Ai) + (1 — 
qi) Poisson (y^; A2) for i = 1, . . . , n, where Poisson(-; A) denotes the probability density function 
of a Poisson distribution of parameter A. We use a Gamma(/3y, j3) prior for the Afc's, and a 
uniform prior for q\ . The n = 100 observations are simulated from this model with parameters 
(qi, Ai, A2) = (0.7, 3, 10). It can be seen again that biasing the posterior distribution towards 
larger values of /3 makes it possible to reduce the distance between the different modes. We 
also obtained interesting preliminary results for multivariate Gaussian mixtures. 




Figure 14: Scatter plots of 1000 simulated pairs (Ai,A2) from the posterior distribution 
of a two-component Poisson mixture model, and n = 100 simulated data points, with a 
Gamma(/3y, (3) prior for the A&, and, from left to right, (3 = 1, 10, 20. 

We would like to highlight some practical advantages of our approach. First, it requires lit- 
tle tuning: The main tuning parameters are the scales of the random walks in both algorithms 
(adaptive, and MCMC), and we obtained satisfactory results without trying to optimize these 
scales. Second, it is easy to check that the final results are correct: If the free energy has 
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been well estimated, and the MCMC algorithm for the biased posterior has converged, then a 
nearly uniform marginal distribution for the reaction coordinate is observed, and the marginal 
distributions for the q^) are nearly identical, because of the symmetry of the true pos- 

terior, and the numerous mode switchings in the MCMC trajectories. 

Finally, one natural question is how to extend such an approach to other classes of Bayesian 
models. As we have made clear already, the choice of the reaction coordinate is the crucial 
point. If the reaction coordinate is poorly chosen, a free energy biasing approach will bring 
no benefit. As for applications in computational Statistical Physics, there is no general 
recipe for choosing this reaction coordinate. Nonetheless, the following simple remarks may 
be considered as some guidelines. First, it seems worth investigating alternatives to the 
reaction coordinate usually chosen in Statistics, namely the negative of the posterior log- 
density. Second, in doing so, one may keep in mind the interpretation we proposed for = (3 
in Section 14.21 *■ e - a particular parameter that fixes to some extent the size of the energy 
barriers between the different modes. In particular, in a given Bayesian hierarchical model, the 
hyper-parameters at the highest level of the hierarchy could be interesting candidates, because 
their values strongly influence the typical values of the other components of the system. More 
research in this direction is however required to draw more definite conclusions. 



Acknowledgements 

Part of this work was done while the two last authors were participating to the program 
"Computational Mathematics" at the Haussdorff Institute for Mathematics in Bonn, Ger- 
many. Support from the ANR grants ANR-008-BLAN-0218 and ANR-09-BLAN-0216 of the 
French Ministry of Research is acknowledged. The authors thank Julien Cornebise, Arnaud 
Doucet, Peter J. Green, Pierre Jacob, Christian P. Robert, Gareth Roberts, and the referees 
for insightful remarks. 



References 

L. Ambrosio, N. Fusco, and D. Pallara. Functions of Bounded Variation and Free Disconti- 
nuity Problems. Oxford Science Publications, 2000. 

C. Andrieu and J. Thorns. A tutorial on adaptive MCMC. Statist. Comput., 18(4):343-373, 
2008. 

Y. F. Atchade and J. S. Liu. The Wang-Landau algorithm for Monte-Carlo computation in 
general state spaces. Stat. Sinica, 20(l):209-233, 2010. 

R. Balian. From Microphysics to Macrophysics. Methods and Applications of Statistical 
Physics, volume I - II. Springer, 2007. 

K. E. Basford, G. J. Mclachlan, and M. G. York. Modelling the distribution of stamp paper 
thickness via finite normal mixtures: The 1872 Hidalgo stamp issue of Mexico revisited. J. 
Appl. Stat, 24(2):169-180, 1997. 

G. Bussi, A. Laio, and M. Parrinello. Equilibrium free energies from nonequilibrium metady- 
namics. Phys. Rev. Lett, 96(9):090601, 2006. 



27 



G. Celeux, M. Hurn, and C. P. Robert. Computational and inferential difficulties with mixture 
posterior distributions. J. Am. Statist. Assoc., 95:957-970, 2000. 

C. Chipot and T. Lelievre. Enhanced sampling of multidimensional free-energy landscapes 
using adaptive biasing forces. arXiv preprint, 1008.3457, 2010. 

E. Darve and A. Pohorille. Calculating free energies using average force. J. Chem. Phys., 115 
(20):9169-9183, 2001. 

B.M. Dickson, F. Legoll, T. Lelivre, G. Stoltz, and P. Fleura-Lessard. Free energy calculations: 
An efficient adaptive biasing potential method. J. Phys. Chem. B, 114(17):5823-5830, 2010. 

J. Diebolt and C. P. Robert. Estimation of finite mixture distributions through Bayesian 
sampling. J. R. Statist. Soc. B, 56:363-375, 1994. 

L. C. Evans and R. F. Gariepy. Measure Theory and Fine Properties of Functions. Studies 
in Advanced Mathematics. CRC Press, 1992. 

S. Fruhwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer, 2006. 

S. Fruhwirth-Schnatter. Markov chain Monte Carlo estimation of classical and dynamic 
switching and mixture models. J. Am. Statist. Assoc., 96(453): 194-209, 2001. 

A. Gelman and X. L. Meng. Simulating normalizing constants: from importance sampling to 
bridge sampling to path sampling. Stat. Sci., 13(2): 163-185, 1998. 

W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. 
Biometrika, 57:97-109, 1970. 

J. Henin and C. Chipot. Overcoming free energy barriers using unconstrained molecular 
dynamics simulations. J. Chem. Phys., 121(7):2904-2914, 2004. 

Y. Iba. Extended ensemble Monte Carlo. Int. J. Modern Phys. C, 12(5):623-656, 2001. 

A. J. Izenman and C. J. Sommer. Philatelic mixtures and multimodal densities. J. Am. Stat. 
Assoc., 83(404) :941-953, 1988. 

A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain Monte Carlo methods and the 
label switching problem in Bayesian mixture modeling. Statist. Science, pages 50-67, 2005. 

B. Jourdain, T. Lelievre, and R. Roux. Existence, uniqueness and convergence of a particle 
approximation for the Adaptive Biasing Force process. ESAIM-Math. Model. Num., 44(5): 
831-865, 2010. 

A. Kong, J. S. Liu, and W. H. Wong. Sequential imputation and bayesian missing data 
problems. J. Am. Statist. Assoc., 89:278-288, 1994. 

T. Lelievre and K. Minoukadeh. Longtime convergence of an adaptive biasing force method: 
The bi-channel case. Arch. Ration. Mech. Anal, accepted, 2011. 

T. Lelievre, M. Rousset, and G. Stoltz. Computation of free energy profiles with parallel 
adaptive dynamics. J. Chem. Phys, 126:134111, 2007. 



28 



T. Lelievre, M. Rousset, and G. Stoltz. Long-time convergence of an adaptive biasing force 
method. Nonlinearity, 21:1155-1181, 2008. 

T. Lelievre, M. Rousset, and G. Stoltz. Free Energy Computations: A Mathematical Perspec- 
tive. Imperial College Press, 2010. 

F. Liang. A generalized Wang-Landau algorithm for Monte-Carlo computation. J. Am. Stat. 
Assoc., 100(472):1311-1327, 2005. 

F. Liang. Trajectory averaging for stochastic approximation MCMC algorithms. Ann. Stat., 
38(5):2823-2856, 2010. 

J. M. Marin and C. P. Robert. Bayesian core: a practical approach to computational Bayesian 
statistics. Springer Verlag, 2007. 

S. Marsili, A. Barducci, R. Chelli, P. Procacci, and V. Schettino. Self-healing Umbrella 
Sampling: A non-equilibrium approach for quantitative free energy calculations. J. Phys. 
Chem. B, 110(29):14011-14013, 2006. 

G. J. McLachlan and D. Peel. Finite mixture models. Wiley New York, 2000. 

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations 
of state calculations by fast computing machines. J. Chem. Phys., 21 (6): 1087-1091, 1953. 

R. M. Neal. Annealed importance sampling. Statist. Comput., 11:125-139, 2001. 

S. Piana and A. Laio. A bias-exchange approach to protein folding. J. Phys. Chem. B., Ill 
(17):4553-4559, 2007. 

P. Raiteri, A. Laio, F. L. Gervasio, C. Micheletti, and M. Parrinello. Efficient reconstruction 
of complex free energy landscapes by multiple walkers metadynamics. J. Phys. Chem. B, 
110(8) :3533-3539, 2006. 

S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number 
of components. J. R. Statist. Soc. B, 59(4):731-792, 1997. 

C. P. Robert and G. Casella. Monte Carlo Statistical Methods, 2nd ed. Springer- Verlag, New 
York, 2004. 

D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture 
Distributions. Wiley, 1986. 

F. G. Wang and D. P. Landau. Determining the density of states for classical statistical 
models: A random walk algorithm to produce a flat histogram. Phys. Rev. E, 64(5): 
056101, 2001a. 

F. G. Wang and D. P. Landau. Efficient, multiple-range random walk algorithm to calculate 
the density of states. Phys. Rev. Lett, 86(10):2050-2053, 2001b. 



29 



